The subject of the dataset is crowdfunding. This is a form of financing in which different investors give different amounts of money for certain projects. So it is earmarked funding. The basis is the history of a crowdfunding platform. All projects are completed projects, i.e. the crowdfunding period has expired. There are NO duplicates in the record.
The split data record contains the following columns (including meaning):
# import the required libraries
import pandas as pd
import numpy as np
import plotly_express as px
# The following libraries are required only for converting plotly express plots to html
# otherwise when one tries to download Jupyter notebook as html file, plotly express plots don't appear
import plotly.offline as py
py.init_notebook_mode(connected=False)
# setting to avoid unnecessary chain assignment warnings
pd.options.mode.chained_assignment = None
# find out the separator of the first csv file
# The follwing ParseError appears when "/n" as separator is used.
# ParserError: Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
# df1 = pd.read_csv(filepath_or_buffer="part1.csv", sep="/n", engine="python")
df1 = pd.read_csv(filepath_or_buffer="part1.csv", sep="//n", engine="python")
df1.head()
| ,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,term_in_months,lender_count,borrower_genders,repayment_interval | |
|---|---|
| 0 | 0,300.0,300.0,Fruits & Vegetables,Food,"To buy... |
| 1 | 1,575.0,575.0,Rickshaw,Transportation,to repai... |
| 2 | 2,150.0,150.0,Transportation,Transportation,To... |
| 3 | 3,200.0,200.0,Embroidery,Arts,to purchase an e... |
| 4 | 4,400.0,400.0,Milk Sales,Food,to purchase one ... |
# "," is the separator
# now read again the csv file with the new separator into the DataFrame
# To avoid the column, Unnamed 0, use index_col
df_part1 = pd.read_csv(
filepath_or_buffer="part1.csv",
sep=',',
engine='python', # engine = 'python' otherwise it receives a warning
index_col=0)
df_part1.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
# find out the separator of the second csv file
# The follwing ParseError appears when "/n" as separator is used.
# ParserError: Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
# df1 = pd.read_csv(filepath_or_buffer="part2.csv", sep="/n", engine="python")
df2 = pd.read_csv(filepath_or_buffer="part2.csv", sep="//n", engine="python")
df2.head()
| #funded_amount#loan_amount#activity#sector#use#country_code#country#region#currency#term_in_months#lender_count#borrower_genders#repayment_interval | |
|---|---|
| 0 | 0#175.0#175.0#Liquor Store / Off-License#Food#... |
| 1 | 1#325.0#325.0#Livestock#Agriculture#to buy 3 z... |
| 2 | 2#550.0#550.0#Food Stall#Food#to buy ingredien... |
| 3 | 3#1300.0#1300.0#Cattle#Agriculture#to buy one ... |
| 4 | 4#900.0#900.0#Consumer Goods#Personal Use#to b... |
# "#" is the separator
# now read again the csv file with the new separator into the DataFrame
# To avoid the column, Unnamed 0, use index_col
df_part2 = pd.read_csv(filepath_or_buffer="part2.csv",
sep='#',
engine='python',
index_col=0) # engine='python' da sonst Warning erhält
df_part2.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 175.0 | 175.0 | Liquor Store / Off-License | Food | to purchase additional stock of coconut wine t... | PH | Philippines | Palo, Leyte | PHP | 8.0 | 6 | female | irregular |
| 1 | 325.0 | 325.0 | Livestock | Agriculture | to buy 3 zebus and food to fatten them up. | MG | Madagascar | Antsirabe | MGA | 12.0 | 13 | female | monthly |
| 2 | 550.0 | 550.0 | Food Stall | Food | to buy ingredients for her food-vending busine... | PH | Philippines | Cordova, Cebu | PHP | 5.0 | 6 | female | irregular |
| 3 | 1300.0 | 1300.0 | Cattle | Agriculture | to buy one head of cattle. | EG | Egypt | Baniswef | EGP | 14.0 | 50 | male | monthly |
| 4 | 900.0 | 900.0 | Consumer Goods | Personal Use | to buy consumer goods amongst others. | PE | Peru | Urubamba - Urubamba - Cusco | PEN | 6.0 | 1 | female | irregular |
# check the columns present in the first dataframe
df_part1.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
# check how many rows and coulumns are present in the first dataframe
df_part1.shape
(335000, 13)
# check the columns present in the second dataframe
df_part2.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
# check how many rows and coulumns are present in the second dataframe
df_part2.shape
(336205, 13)
# combine the rows of the two dataframes as both have the same columns
# the concatenation will create a new integer index when ignore_index is set to true
df_part3 = pd.concat([df_part1, df_part2], ignore_index=True)
df_part3.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
# check how many rows and coulumns are present in the dataframe obtained after concatinating the two dataframes
df_part3.shape
(671205, 13)
# find if any duplicated data is present
duplicated_rows = df_part3[df_part3.duplicated()]
duplicated_rows.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 498 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 606 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 808 | 450.0 | 450.0 | Higher education costs | Education | to pay for one semester's registration fees. | CO | Colombia | Bogotà | COP | 7.0 | 15 | female | monthly |
| 1703 | 500.0 | 500.0 | Higher education costs | Education | To buy a laptop for educational purposes. | SO | Somalia | Hargeisa | USD | 8.0 | 19 | male | monthly |
| 2317 | 250.0 | 250.0 | Poultry | Agriculture | to purchase poultry. | KE | Kenya | Ndaragwa | KES | 16.0 | 10 | female | monthly |
# drop the duplicate rows from the concatanated dataframe
df_part4 = df_part3.drop_duplicates(subset=df_part3.columns)
df_part4.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
# check how many rows and coulumns are present in the dataframe obtained after dropping the duplicate rows
df_part4.shape
(646833, 13)
#get an overview of the data type of the contents of the columns
# and missing values present
# and memory usage
df_part4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646833 entries, 0 to 671188 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null float64 1 loan_amount 646833 non-null float64 2 activity 646833 non-null object 3 sector 646833 non-null object 4 use 642934 non-null object 5 country_code 646825 non-null object 6 country 646833 non-null object 7 region 590670 non-null object 8 currency 646833 non-null object 9 term_in_months 646833 non-null float64 10 lender_count 646833 non-null int64 11 borrower_genders 642945 non-null object 12 repayment_interval 646833 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 69.1+ MB
# use describe to get the statistical summary
df_part4.describe(include='all')
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 646833.000000 | 646833.000000 | 646833 | 646833 | 642934 | 646825 | 646833 | 590670 | 646833 | 646833.000000 | 646833.000000 | 642945 | 646833 |
| unique | NaN | NaN | 163 | 15 | 424912 | 86 | 87 | 12695 | 67 | NaN | NaN | 11298 | 4 |
| top | NaN | NaN | Farming | Agriculture | to buy a water filter to provide safe drinking... | PH | Philippines | Lahore | PHP | NaN | NaN | female | monthly |
| freq | NaN | NaN | 69497 | 174624 | 2074 | 154323 | 154323 | 7168 | 154322 | NaN | NaN | 415237 | 330448 |
| mean | 807.845170 | 866.239238 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.718920 | 21.110709 | NaN | NaN |
| std | 1145.283451 | 1214.135953 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.479172 | 28.837152 | NaN | NaN |
| min | 0.000000 | 25.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | NaN | NaN |
| 25% | 275.000000 | 275.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.000000 | 7.000000 | NaN | NaN |
| 50% | 475.000000 | 500.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.000000 | 13.000000 | NaN | NaN |
| 75% | 925.000000 | 1000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14.000000 | 25.000000 | NaN | NaN |
| max | 100000.000000 | 100000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 158.000000 | 2986.000000 | NaN | NaN |
# Identification via data type - comparison of data type with column content
# Check for synonyms
df_part4.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 borrower_genders object repayment_interval object dtype: object
# converting to int datatype also doesnt not raise any kind of error
list1=['funded_amount','loan_amount','term_in_months']
for i in list1:
pd.to_numeric(df_part4[i], errors='raise')
# check whether the column "activity" contains a synonym
df_part4.activity.unique()
array(['Fruits & Vegetables', 'Rickshaw', 'Transportation', 'Embroidery',
'Milk Sales', 'Services', 'Dairy', 'Beauty Salon', 'Manufacturing',
'Food Production/Sales', 'Wholesale', 'General Store',
'Clothing Sales', 'Poultry', 'Tailoring', 'Sewing', 'Bakery',
'Restaurant', 'Food Stall', 'Farming', 'Construction Supplies',
'Personal Products Sales', 'Home Products Sales',
'Natural Medicines', 'Fish Selling', 'Education provider',
'Shoe Sales', 'Machinery Rental', 'Butcher Shop', 'Pigs',
'Personal Expenses', 'Food Market', 'Cosmetics Sales',
'Personal Housing Expenses', 'Retail', 'Energy', 'Grocery Store',
'Construction', 'Agriculture', 'Motorcycle Transport',
'Charcoal Sales', 'Food', 'Pharmacy', 'Fishing', 'Timber Sales',
'Cattle', 'Electronics Repair', 'Electronics Sales', 'Vehicle',
'Cafe', 'Blacksmith', 'Higher education costs', 'Used Clothing',
'Fuel/Firewood', 'Upholstery', 'Catering', 'Animal Sales',
'Cereals', 'Vehicle Repairs', 'Arts',
'Cloth & Dressmaking Supplies', 'Mobile Phones', 'Spare Parts',
'Clothing', 'Metal Shop', 'Barber Shop', 'Furniture Making',
'Crafts', 'Home Energy', 'Home Appliances', 'Wedding Expenses',
'Taxi', 'Secretarial Services', 'Livestock', 'Property',
'Recycling', 'Farm Supplies', 'Auto Repair', 'Beverages',
'Plastics Sales', 'Electrical Goods', 'Carpentry', 'Photography',
'Jewelry', 'Bricks', 'Pub', 'Phone Use Sales',
'Water Distribution', 'Paper Sales', 'Computers',
'Liquor Store / Off-License', 'Utilities', 'Knitting', 'Weaving',
'Party Supplies', 'Medical Clinic', 'Internet Cafe',
'Consumer Goods', 'Cement', 'Electrician',
'Primary/secondary school costs', 'Veterinary Sales',
'Land Rental', 'Laundry', 'Call Center', 'Perfumes', 'Hotel',
'Motorcycle Repair', 'Movie Tapes & DVDs', 'Quarrying',
'Personal Medical Expenses', 'Bookstore', 'Decorations Sales',
'Recycled Materials', 'Office Supplies', 'Souvenir Sales',
'Renewable Energy Products', 'Health', 'Printing', 'Phone Repair',
'Traveling Sales', 'Flowers', 'Bicycle Repair', 'Entertainment',
'Phone Accessories', 'Hardware', 'Used Shoes',
'Music Discs & Tapes', 'Games', 'Balut-Making', 'Textiles',
'Child Care', 'Goods Distribution', 'Florist', 'Cobbler', 'Dental',
'Bookbinding', 'Cheese Making', 'Bicycle Sales', 'Well digging',
'Technology', 'Musical Performance', 'Waste Management', 'Film',
'Tourism', 'Musical Instruments', 'Religious Articles',
'Machine Shop', 'Cleaning Services', 'Sporting Good Sales',
'Patchwork', 'Funerals', 'Air Conditioning', 'Communications',
'Adult Care', 'Landscaping / Gardening', 'Aquaculture',
'Beekeeping', 'Event Planning', 'Celebrations', 'Computer',
'Personal Care Products', 'Mobile Transactions'], dtype=object)
# check whether the column "sector" contains a synonym
df_part4.sector.unique()
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture',
'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction',
'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'],
dtype=object)
# check whether the column "use" contains a synonym
df_part4.use.unique()
array(['To buy seasonal, fresh fruits to sell. ',
'to repair and maintain the auto rickshaw used in their business.',
'To repair their old cycle-van and buy another one to rent out as a source of income',
...,
'Pretend the issue with loan got addressed by Kiva Coordinator.',
'Kiva Coordinator replaced loan use. Should see this in viewdiff.',
'Edited loan use in english.'], dtype=object)
# check whether the column "country_code" contains a synonym
df_part4.country_code.unique()
array(['PK', 'IN', 'KE', 'NI', 'SV', 'TZ', 'PH', 'PE', 'SN', 'KH', 'LR',
'VN', 'IQ', 'HN', 'PS', 'MN', 'US', 'ML', 'CO', 'TJ', 'GT', 'EC',
'BO', 'YE', 'GH', 'SL', 'HT', 'CL', 'JO', 'UG', 'BI', 'BF', 'TL',
'ID', 'GE', 'UA', 'XK', 'AL', 'CD', 'CR', 'SO', 'ZW', 'CM', 'TR',
'AZ', 'DO', 'BR', 'MX', 'KG', 'AM', 'PY', 'LB', 'WS', 'IL', 'RW',
'ZM', 'NP', 'CG', 'MZ', 'ZA', 'TG', 'BJ', 'BZ', 'SR', 'TH', 'NG',
'MR', 'VU', 'PA', 'VI', 'VC', 'LA', 'MW', 'MM', 'MD', 'SS', 'SB',
'CN', 'EG', 'GU', 'AF', 'MG', nan, 'PR', 'LS', 'CI', 'BT'],
dtype=object)
# check whether the column "country" contains a synonym
df_part4.country.unique()
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
'Vanuatu', 'Panama', 'Virgin Islands',
'Saint Vincent and the Grenadines',
"Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
# check whether the column "region" contains a synonym
df_part4.region.unique()
array(['Lahore', 'Maynaguri', 'Abdul Hakeem', ..., 'Gbenikoro Village',
'Morimaraia', 'alejandria'], dtype=object)
# check whether the column "currency" contains a synonym
df_part4.currency.unique()
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
# check whether the column "borrower_genders" contains a synonym
df_part4.borrower_genders.unique()
array(['female', 'female, female', 'female, female, female', ...,
'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
dtype=object)
# check whether the column "repayment_interval" contains a synonym
df_part4.repayment_interval.unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
# check the total count of the NA values in each column
df_part4.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 3899 country_code 8 country 0 region 56163 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
# check for which country country_code is NA value
df_part4.loc[df_part4.country_code.isnull()]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
| 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
| 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
| 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
| 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
| 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NaN | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
| 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NaN | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
| 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NaN | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
# The country code of Namibia is NA, assign the same to the country_code column where the country is Namibia
df_part4.loc[df_part4.country_code.isnull() & (df_part4.country == 'Namibia'),
['country_code']] = 'NA'
# check the dataframe for the remaining NA values
df_part4.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 3899 country_code 0 country 0 region 56163 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
# The null values in column "use" are 0.06% of the total values. These can be replaced with empty strings.
df_part4.loc[df_part4.use.isnull(), ['use']] = ''
# check the dataframe for the remaining NA values
df_part4.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 region 56163 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
# check for which countries, the region is null, sort by ascending order
for i in df_part4.loc[df_part4.region.isnull(), ['country']].value_counts().index:
try:
# replace NA value with the most occuring region for a specific country
str1 = df_part4.loc[df_part4.country == i[0], 'region'].value_counts().idxmax()
df_part4.loc[df_part4.region.isnull() & (df_part4.country == i[0]),
['region']] = str1
except:
# in case of any error, replace with an empty string
df_part4.loc[df_part4.region.isnull() & (df_part4.country == i[0]),
['region']] = ""
df_part4.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 region 0 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
# Depending on the column "borrower_genders", create one more column "borrower_count" which contains the total no. of borrowers.
male_list = []
female_list = []
for i in df_part4.borrower_genders:
try:
str1 = i.split(", ")
male_list.append(str1.count('male'))
female_list.append(str1.count('female'))
except:
# in case of exception, append NAN values
male_list.append(np.nan)
female_list.append(np.nan)
df_female = pd.DataFrame(female_list)
df_male = pd.DataFrame(male_list)
df_part4['borrower_count'] = df_female + df_male
# The missing values in column "borrower_count" can be replaced with 1 as there will be atleast 1 borrower for all these cases
df_part4.loc[df_part4.borrower_count.isnull(), ['borrower_count']] = 1
df_part4.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | borrower_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular | 1.0 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular | 2.0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet | 1.0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular | 1.0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly | 1.0 |
# find out the memory usage of each column in bytes
df_part4.memory_usage(deep=True)
Index 5174664 funded_amount 5174664 loan_amount 5174664 activity 44777911 sector 41969385 use 76126786 country_code 38163147 country 42335536 region 46359200 currency 38809980 term_in_months 5174664 lender_count 5174664 borrower_genders 45391955 repayment_interval 41831080 borrower_count 5174664 dtype: int64
df_part4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646833 entries, 0 to 671188 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null float64 1 loan_amount 646833 non-null float64 2 activity 646833 non-null object 3 sector 646833 non-null object 4 use 646833 non-null object 5 country_code 646833 non-null object 6 country 646833 non-null object 7 region 646833 non-null object 8 currency 646833 non-null object 9 term_in_months 646833 non-null float64 10 lender_count 646833 non-null int64 11 borrower_genders 642945 non-null object 12 repayment_interval 646833 non-null object 13 borrower_count 646833 non-null float64 dtypes: float64(4), int64(1), object(9) memory usage: 74.0+ MB
# for further optimization, convert the types of column with string values to category
list1 = ['activity', 'sector', 'use', 'country_code', 'country', 'region', 'currency', 'borrower_genders','repayment_interval']
for i in list1:
df_part4[i] = df_part4[i].astype('category')
list1=['funded_amount','loan_amount','term_in_months','borrower_count']
for i in list1:
df_part4[i] = pd.to_numeric(df_part4[i], errors='coerce', downcast='integer')
df_part4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646833 entries, 0 to 671188 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null int32 1 loan_amount 646833 non-null int32 2 activity 646833 non-null category 3 sector 646833 non-null category 4 use 646833 non-null category 5 country_code 646833 non-null category 6 country 646833 non-null category 7 region 646833 non-null category 8 currency 646833 non-null category 9 term_in_months 646833 non-null int16 10 lender_count 646833 non-null int64 11 borrower_genders 642945 non-null category 12 repayment_interval 646833 non-null category 13 borrower_count 646833 non-null int8 dtypes: category(9), int16(1), int32(2), int64(1), int8(1) memory usage: 50.3 MB
# Create a new column success
df_part4["Success_Rate"] = df_part4["funded_amount"] / df_part4["loan_amount"]
df_part4.describe()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_count | Success_Rate | |
|---|---|---|---|---|---|---|
| count | 646833.000000 | 646833.000000 | 646833.000000 | 646833.000000 | 646833.000000 | 646833.000000 |
| mean | 807.845170 | 866.239238 | 13.718920 | 21.110709 | 1.986731 | 0.959461 |
| std | 1145.283451 | 1214.135953 | 8.479172 | 28.837152 | 3.402256 | 0.160691 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 275.000000 | 275.000000 | 8.000000 | 7.000000 | 1.000000 | 1.000000 |
| 50% | 475.000000 | 500.000000 | 13.000000 | 13.000000 | 1.000000 | 1.000000 |
| 75% | 925.000000 | 1000.000000 | 14.000000 | 25.000000 | 1.000000 | 1.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 | 50.000000 | 1.133333 |
# Where the success ratio is more than 1, consider it as a big success of the project
# where it is 1, consider that as the target achieved.
# where it is 0.9 to 1, consider it as nearing the target.
# where it is below 0.9, consider it as failed.
# calculate the success rate for each respective countries
list1 = []
for i in df_part4["Success_Rate"]:
if(i > 1):
list1.append("Big Success")
elif(i == 1):
list1.append("Target achieved")
elif(0.90 < i < 1):
list1.append("Near the target")
else:
list1.append("Failed")
df_part4['success'] = list1
# find out where the success_rate is highest
df_part4.loc[df_part4.success == "Big Success"]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | borrower_count | Success_Rate | success | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 277188 | 425 | 400 | General Store | Retail | to buy beverages, rice, laundry detergent, sug... | MZ | Mozambique | Boane, Maputo | MZN | 17 | 11 | male | monthly | 1 | 1.062500 | Big Success |
| 338159 | 3400 | 3000 | Farm Supplies | Agriculture | to pay for wires for the grape orchard, cover ... | AM | Armenia | Hoktember village, Armavir region | USD | 38 | 84 | male | monthly | 1 | 1.133333 | Big Success |
df_sunburst = df_part4.groupby(["success","sector","activity"]).agg({"funded_amount":np.mean, "Success_Rate":np.mean, "lender_count":np.mean})
df_sunburst.reset_index(inplace=True)
# Create a sunburst chart to find out the distribution of the funded amount
# and the success rate for various sectors and activities
fig1 = px.sunburst(df_sunburst,
path=['success','sector', 'activity'],
values='funded_amount',
color='success',
hover_data=["Success_Rate", "lender_count"],
color_discrete_sequence=["pink", "green", "yellow", "red"],
title="Funded amount, lender count and the success ratio of the crowdfunding of the various sectors and activities")
fig1.update_layout(paper_bgcolor="#fcfae8", font_color="blue")
fig1.show()
# find out the average no. of lenders and borrowers depending on the repayment interval and term in months
df_borrower = df_part4.groupby(["repayment_interval","term_in_months"]).agg({"lender_count":np.mean, "borrower_count":np.mean})
df_borrower.reset_index(inplace=True)
# Create the categories for no. of borrowers
kategorien = ['very less','less','average','more']
max_value = df_borrower['borrower_count'].max()
min_value = df_borrower['borrower_count'].min()
einteilung = [0,2,5,10,max_value]
df_borrower['Borrowers'] = pd.cut(df_borrower['borrower_count'], bins=einteilung, labels=kategorien)
# Create the categories for years
kategorien = ['< 1','between 1 and 3','between 3 and 5','between 5 and 10','more than 10']
max_value = df_borrower['term_in_months'].max()
einteilung = [0,12,36,60,120,max_value]
df_borrower['Years'] = pd.cut(df_borrower['term_in_months'], bins=einteilung, labels=kategorien)
# Create a plot to find out the average no. of lenders and borrowers depending on the repayment interval
# and term in months
fig = px.bar(df_borrower,
x="Borrowers",
y="lender_count",
facet_col="repayment_interval",
color="Years",
hover_data=['borrower_count','term_in_months'],
title="Average no. of lenders and borrowers depending on the repayment interval and term in months", log_y = True)
fig.update_layout(paper_bgcolor="#fcfae8")
fig.show()
# Read the csv file with the iso country codes into the DataFrame
df_part5 = pd.read_csv(filepath_or_buffer="iso-country-codes.csv",
sep=',',
engine='python') # engine='python' da sonst Warning exxrhält
df_part5.head()
| English short name lower case | Alpha-2 code | Alpha-3 code | Numeric code | ISO 3166-2 | |
|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 4 | ISO 3166-2:AF |
| 1 | Ã…land Islands | AX | ALA | 248 | ISO 3166-2:AX |
| 2 | Albania | AL | ALB | 8 | ISO 3166-2:AL |
| 3 | Algeria | DZ | DZA | 12 | ISO 3166-2:DZ |
| 4 | American Samoa | AS | ASM | 16 | ISO 3166-2:AS |
# Create a new column "iso_alpha" with the iso country codes for each respective country
df_part5.drop(['Numeric code','ISO 3166-2'],axis=1,inplace=True)
df_part5.rename(mapper={'English short name lower case':'Name','Alpha-2 code': 'XX','Alpha-3 code':'XXX'}, axis=1, copy=True, inplace=True)
df_part4['iso_alpha']=df_part4.country_code.map(df_part5.set_index('XX')['XXX'])
df_part4[df_part4['iso_alpha'].isnull()].groupby('country').count()
df_part4['iso_alpha'][df_part4['country']=='Kosovo']='KOS'
df_part4['iso_alpha'][df_part4['country']=='South Sudan']='SSD'
df_part4['iso_alpha'][df_part4['country']=='Namibia']='NAM'
# Create the dataframe
df_success_rate = df_part4.groupby(["iso_alpha"]).agg({'Success_Rate':np.mean,'funded_amount': np.sum})
df_success_rate.sort_values(by="Success_Rate", ascending=False)
# Create a new column success
list1 = []
for i in df_success_rate.Success_Rate:
if(i > 1):
list1.append("Big Success")
elif(i == 1):
list1.append("Target achieved")
elif(0.90 < i < 1):
list1.append("Near the target")
else:
list1.append("Failed")
df_success_rate['success'] = list1
df_success_rate.reset_index(inplace=True)
df_success_rate.sort_values(by="funded_amount", ascending=False)
| iso_alpha | Success_Rate | funded_amount | success | |
|---|---|---|---|---|
| 57 | PHL | 0.990799 | 53174650 | Near the target |
| 33 | KEN | 0.957151 | 31813680 | Near the target |
| 56 | PER | 0.987106 | 30387450 | Near the target |
| 59 | PRY | 0.987679 | 29401725 | Near the target |
| 65 | SLV | 0.908408 | 23345975 | Near the target |
| ... | ... | ... | ... | ... |
| 48 | MRT | 1.000000 | 15000 | Target achieved |
| 0 | AFG | 1.000000 | 14000 | Target achieved |
| 81 | VUT | 1.000000 | 9250 | Target achieved |
| 25 | GUM | 0.091860 | 395 | Failed |
| 79 | VIR | 0.000000 | 0 | Failed |
87 rows × 4 columns
# plot the graph to see how much the crowd funding has been successful in each country
px.choropleth(df_success_rate,
locations="iso_alpha",
color="success",
hover_name="iso_alpha",
hover_data=['Success_Rate', 'funded_amount'],
title="Success ratio and funded amount per country",
color_discrete_sequence=["green", "yellow", "red"],
labels={"success_rate": "Success Rate"})
Conclusion